This notebook is to analyze the daily time series data of Citi Bike rides in NYC after Covid from 2022 March 1st to 2024 March 31st. The goal is to forecast NYC bike demand to help Citi Bike company better optimize bike distribution and maintenance planning in NYC.
The final data includes the bike demand variable “number_of_rides” and numerical weather variables. I included weather variables to help predict the bike demand in some of the models below.
Note: I aggregated 3.3 million bike rides record in 2 years into daily time series and combined it with averaged daily weather data. These data pre-processing steps are omitted for simplicity.
Investigated the characteristics of the time series, such as variance, trend, seasonality, stationarity, etc.
Experimented with 6 types of models, optimized model parameters, and conducted model diagnostics based on several criteria, such as AICc and residuals.
Model 1: SARIMA(single seasonality)
Model 2: Linear Regression(Weather Predictors)
Model 3: Regression with ARIMA Errors(Weather Predictors)
Model 4: Dynamic Harmonic Regression(Multi-seasonality)
Model 5: TBATS(Complex Seasonality)
Model 6: Intervention Analysis(No Intervention Effect from Covid-19)
Selected Dynamic Harmonic Regression Model as the best model due to the multi-seasonality nature of the data and evaluated the model’s forecast accuracy via cross-validation using various error metrics.
Used the best Dynamic Harmonic Regression to forecast next 30 days(April 2024) and compared forecast values against actual Citi Bike rides.
final_average_df <- read.csv("final_daily_nyc_df.csv")
daily_ts <- ts(final_average_df$number_of_rides, start = c(2022, 60), frequency = 365.25)
The meaning of frequency:
Frequency = 365.25: Forecasting functions will consider the data as having a daily cycle throughout the year, accounting for leap years over long-term predictions. This can affect the forecasting of trends and seasonality.
Frequency = 1: The data are considered as sequential with no seasonal or cyclic pattern, so forecasts will not try to model any periodic changes.
print(head(daily_ts))
## [1] 57432 68718 56795 51059 52695 55948
print(length(daily_ts))
## [1] 762
plot(daily_ts, xlab = "Time", ylab = "Number of Rides", main = "Daily Bike Rides from March 2022 to March 2024")
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
lambda <- BoxCox.lambda(daily_ts)
daily_ts_boxcox <- BoxCox(daily_ts, lambda)
plot(daily_ts_boxcox, main = paste("Box-Cox Transformed Time Series Data, Lambda =", round(lambda, 2)), xlab = "Time", ylab = "Transformed Bike Rides")
lambda = 0.37: a Box-Cox transformation is needed to stabilize the variance.
library(forecast)
Acf(daily_ts, main="ACF for Daily Demand of Bike Rides")
Pacf(daily_ts, main="PACF for Daily Demand of Bike Rides")
Observations:
bike_decompose <- decompose(daily_ts)
plot(bike_decompose)
library(forecast)
bike_decompose <- stl(daily_ts, s.window="periodic")
plot(bike_decompose)
plot(bike_decompose$time.series[, "trend"], main="Trend Component")
plot(bike_decompose$time.series[, "seasonal"], main="Seasonal Component")
plot(bike_decompose$time.series[, "remainder"], main="Residual Component")
library(tseries)
kpss_test_original <- kpss.test(daily_ts)
print("KPSS Test for Original Data:")
## [1] "KPSS Test for Original Data:"
print(kpss_test_original)
##
## KPSS Test for Level Stationarity
##
## data: daily_ts
## KPSS Level = 0.63784, Truncation lag parameter = 6, p-value = 0.0192
p-value = 0.02 < 0.05: I reject hypothesis that the time series is trend-stationary, indicating non-stationary for increasing trend(Need trend differencing d = 1).
library(tseries)
adf_test <- adf.test(daily_ts, alternative = "stationary")
adf_test$p.value
## [1] 0.07014074
A larger p-value (0.07 > 0.05) indicates I fail to reject the null hypothesis, and conclude that the series is nonstationary.
library(tseries)
daily_ts_seasonal_diff1 <- diff(daily_ts, lag=7)
kpss_test <- kpss.test(daily_ts_seasonal_diff1)
## Warning in kpss.test(daily_ts_seasonal_diff1): p-value greater than printed
## p-value
print("KPSS Test for Weekly Seasonality:")
## [1] "KPSS Test for Weekly Seasonality:"
print(kpss_test)
##
## KPSS Test for Level Stationarity
##
## data: daily_ts_seasonal_diff1
## KPSS Level = 0.046564, Truncation lag parameter = 6, p-value = 0.1
adf_test <- adf.test(daily_ts_seasonal_diff1, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff1, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Weekly Seasonality:")
## [1] "ADF Test P-value for Weekly Seasonality:"
adf_test$p.value
## [1] 0.01
The time series is stationary after weekly seasonal differencing.
library(tseries)
daily_ts_seasonal_diff2 <- diff(daily_ts, lag=28)
kpss_test <- kpss.test(daily_ts_seasonal_diff2)
print("KPSS Test for Monthly Seasonality:")
## [1] "KPSS Test for Monthly Seasonality:"
print(kpss_test)
##
## KPSS Test for Level Stationarity
##
## data: daily_ts_seasonal_diff2
## KPSS Level = 0.40645, Truncation lag parameter = 6, p-value = 0.07437
adf_test <- adf.test(daily_ts_seasonal_diff2, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff2, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Monthly Seasonality:")
## [1] "ADF Test P-value for Monthly Seasonality:"
adf_test$p.value
## [1] 0.01
The time series is stationary after monthly seasonal differencing.
library(tseries)
daily_ts_seasonal_diff3 <- diff(daily_ts, lag=365)
kpss_test <- kpss.test(daily_ts_seasonal_diff3)
## Warning in kpss.test(daily_ts_seasonal_diff3): p-value greater than printed
## p-value
print("KPSS Test for Annual Seasonality:")
## [1] "KPSS Test for Annual Seasonality:"
print(kpss_test)
##
## KPSS Test for Level Stationarity
##
## data: daily_ts_seasonal_diff3
## KPSS Level = 0.040434, Truncation lag parameter = 5, p-value = 0.1
adf_test <- adf.test(daily_ts_seasonal_diff3, alternative = "stationary")
## Warning in adf.test(daily_ts_seasonal_diff3, alternative = "stationary"):
## p-value smaller than printed p-value
print("ADF Test P-value for Annual Seasonality:")
## [1] "ADF Test P-value for Annual Seasonality:"
adf_test$p.value
## [1] 0.01
The time series is stationary after annual seasonal differencing.
Conclusion: The daily time series might have multiple seasonalities(weekly, monthly, and annual seasonality).
arima_model_1 <- auto.arima(daily_ts, seasonal = TRUE, lambda = 0.37)
## Warning: The time series frequency has been rounded to support seasonal
## differencing.
summary(arima_model_1)
## Series: daily_ts
## ARIMA(1,0,0)(0,1,0)[365] with drift
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## ar1 drift
## 0.2949 0.0270
## s.e. 0.0479 0.0044
##
## sigma^2 = 508.8: log likelihood = -1799.3
## AIC=3604.6 AICc=3604.66 BIC=3616.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 159.6668 19900.41 10453.97 -3.894772 14.25003 0.4393484 0.02238712
The model has recognized a significant seasonal pattern occurring every 365 days, reinforcing the choice of using an annual pattern for seasonality treatment. But SARIMA can only capture single seasonality.
ARIMA(1,0,0)(0,1,0)[365] with drift : AICc=3604.66
checkresiduals(arima_model_1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,0)[365] with drift
## Q* = 358.49, df = 151, p-value < 2.2e-16
##
## Model df: 1. Total lags used: 152
A small p-value from Ljung-Box test suggests that there is significant autocorrelation in residuals, indicating that the model hasn’t captured most of the temporal structure in the data. There are too many significant lags in ACF as well, indicating auto-correlations in the residuals. This SARIMA model is underfitting.
library(fpp)
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(TSA)
## Registered S3 methods overwritten by 'TSA':
## method from
## fitted.Arima forecast
## plot.Arima forecast
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
eacf(daily_ts)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x x x x
## 1 x x o o o o x o o o o o o x
## 2 x o o x o o x x o o o o o o
## 3 x o o o o o x o o o o o o o
## 4 x x x o o o x o o o o o o o
## 5 x x x x o x x o o o o o o o
## 6 x x x x x x x o o o o o o o
## 7 o x x x x x x o x o o o o o
eacf() only handles non seasonal parts of arima(p, q).
arima_model_2 <- Arima(daily_ts, order=c(2,0,1), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_2)
## Series: daily_ts
## ARIMA(2,0,1)(0,1,0)[365] with drift
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ar2 ma1 drift
## 0.1428 0.033 0.1562 0.0270
## s.e. NaN NaN NaN 0.0043
##
## sigma^2 = 511: log likelihood = -1799.26
## AIC=3608.53 AICc=3608.68 BIC=3628.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 157.1601 19896.73 10455.72 -3.895559 14.25095 0.4394219 0.0178957
AICc=3608.68
checkresiduals(arima_model_2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,1)(0,1,0)[365] with drift
## Q* = 356.75, df = 149, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 152
arima_model_3 <- Arima(daily_ts, order=c(1,1,0), seasonal=c(0,1,0), lambda = 0.37)
summary(arima_model_3)
## Series: daily_ts
## ARIMA(1,1,0)(0,1,0)[365]
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## ar1
## -0.3421
## s.e. 0.0472
##
## sigma^2 = 692.5: log likelihood = -1859.9
## AIC=3723.8 AICc=3723.83 BIC=3731.76
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -715.9327 24210.33 12734.49 -4.529659 16.44589 0.5351916
## ACF1
## Training set -0.03113498
AICc=3723.83
arima_model_4 <- Arima(daily_ts, order=c(1,0,1), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_4)
## Series: daily_ts
## ARIMA(1,0,1)(0,1,0)[365] with drift
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## ar1 ma1 drift
## 0.2535 0.0455 0.0270
## s.e. 0.1608 0.1659 0.0043
##
## sigma^2 = 509.7: log likelihood = -1799.26
## AIC=3606.52 AICc=3606.63 BIC=3622.46
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 154.3736 19896.67 10456.62 -3.898429 14.25234 0.4394598 0.01773766
AICc=3606.63
arima_model_5 <- Arima(daily_ts, order=c(1,0,2), seasonal=c(0,1,0), include.drift=TRUE, lambda = 0.37)
summary(arima_model_5)
## Series: daily_ts
## ARIMA(1,0,2)(0,1,0)[365] with drift
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## ar1 ma1 ma2 drift
## 0.1787 0.1201 0.0248 0.0270
## s.e. 0.4916 0.4903 0.1535 0.0043
##
## sigma^2 = 511: log likelihood = -1799.26
## AIC=3608.51 AICc=3608.66 BIC=3628.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 150.8366 19895.72 10460.67 -3.901439 14.25689 0.4396298 0.01765922
AICc=3608.66
arima_model_6 <- Arima(daily_ts, order=c(1,0,0), seasonal=c(0,1,0), lambda = 0.37)
summary(arima_model_6)
## Series: daily_ts
## ARIMA(1,0,0)(0,1,0)[365]
## Box Cox transformation: lambda= 0.37
##
## Coefficients:
## ar1
## 0.3993
## s.e. 0.0459
##
## sigma^2 = 548.4: log likelihood = -1814.85
## AIC=3633.7 AICc=3633.73 BIC=3641.66
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 4265.372 20349.81 11200.93 0.7656516 14.27421 0.4707407
## ACF1
## Training set -0.02986718
AICc=3633.73
Conclusion: the SARIMA model with lowest AICc is ARIMA(1,0,0)(0,1,0)[365] with drift with AICc=3604.66
checkresiduals(arima_model_1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,0)[365] with drift
## Q* = 358.49, df = 151, p-value < 2.2e-16
##
## Model df: 1. Total lags used: 152
shapiro.test(arima_model_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: arima_model_1$residuals
## W = 0.77065, p-value < 2.2e-16
Ljung-Box test: p-value < 2.2e-16 -> residuals are autocorrelated.
Shapiro-Wilk normality test: p-value < 2.2e-16 -> residuals are not normally distributed.
qqnorm(arima_model_1$residuals,main=expression(Normal~~Q-Q~~Plot))
qqline(arima_model_1$residuals)
Conclusion with SARIMA Model:
The best model is ARIMA(1,0,0)(0,1,0)[365] with drift with lowest AICc=3604.66.
But this model’s residuals are autocorrelated and aren’t normally distributed. There are systematic patterns in the data that the model fails to capture.
Therefore, this model is underfitting and the potential reason is that ARIMA/SARIMA models can only capture single seasonality.
I will examine multi-variable and multi-seasonality scenarios.
final_average_df <- read.csv("final_daily_nyc_df.csv")
head(final_average_df)
## date tmpf dwpf relh drct sknt p01i alti
## 1 2022-03-01 40.50000 23.32917 51.52625 156.4286 5.000000 0.000000000 30.13625
## 2 2022-03-02 45.58333 27.49583 50.73917 208.3333 4.636364 0.000000000 29.97792
## 3 2022-03-03 37.93462 15.85000 43.33346 293.5000 8.760000 0.001363636 30.08000
## 4 2022-03-04 29.07391 4.40000 35.18652 205.8333 4.666667 0.000000000 30.50174
## 5 2022-03-05 39.43478 19.26522 44.81957 65.0000 3.272727 0.000000000 30.47130
## 6 2022-03-06 54.00000 46.23684 75.53211 131.4286 4.529412 0.006875000 30.06237
## mslp vsby feel number_of_rides
## 1 1019.704 10.000000 36.81583 57432
## 2 1014.321 9.708333 42.85409 68718
## 3 1018.418 9.884615 31.71600 56795
## 4 1032.178 10.000000 23.36524 51059
## 5 1031.030 10.000000 36.90227 52695
## 6 1016.413 7.684211 53.42237 55948
tail(final_average_df)
## date tmpf dwpf relh drct sknt p01i
## 757 2024-03-26 44.87027 28.58919 53.35378 57.60000 5.694444 0.0000000000
## 758 2024-03-27 46.51053 39.76842 77.60816 65.58824 2.972973 0.0011764706
## 759 2024-03-28 48.58710 44.85484 86.95145 109.38776 3.066667 0.0465217391
## 760 2024-03-29 46.70833 20.83333 36.32000 304.21053 9.476190 0.0000000000
## 761 2024-03-30 50.04167 25.08333 41.03875 281.87500 7.391304 0.0004761905
## 762 2024-03-31 54.65385 33.07692 46.50577 195.38462 4.238095 0.0020833333
## alti mslp vsby feel number_of_rides
## 757 30.22973 1022.600 9.972973 41.93108 105245
## 758 30.11184 1019.038 7.631579 44.82135 95913
## 759 29.98113 1013.583 5.387097 47.41683 55955
## 760 29.77167 1007.329 10.000000 42.73909 99427
## 761 29.78792 1007.929 10.000000 47.66042 97784
## 762 29.89192 1011.275 9.557692 54.28192 93274
final_average_df$date <- as.Date(final_average_df$date)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_matrix <- final_average_df %>%
select(-date) %>%
as.matrix()
multivar_ts <- ts(data_matrix, start = c(2022, 60), frequency = 365.25)
str(multivar_ts)
## Time-Series [1:762, 1:11] from 2022 to 2024: 40.5 45.6 37.9 29.1 39.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:11] "tmpf" "dwpf" "relh" "drct" ...
correlations <- cor(multivar_ts[, -11], multivar_ts[, 11])
print(correlations)
## [,1]
## tmpf 0.738914694
## dwpf 0.561297552
## relh -0.127238016
## drct -0.253614088
## sknt -0.520515844
## p01i -0.256448227
## alti 0.019352370
## mslp -0.009804981
## vsby 0.335498160
## feel 0.740480305
library(forecast)
# Chose the 7-regressor combination from linear regression with adjusted R-squared 0.7003.
regressors1 <- c('tmpf', 'dwpf', 'sknt', 'feel', 'vsby', 'drct', 'p01i')
# Apply seasonal differencing to each regressor
seasonal_lag <- 365
diffed_regressors1 <- sapply(multivar_ts[, regressors1], function(x) diff(x, lag = seasonal_lag))
rides_aligned <- rides[-(1:seasonal_lag)]
check_stationarity <- function(ts_data) {
adf_result <- adf.test(ts_data)
kpss_result <- kpss.test(ts_data)
return(list(adf_p_value = adf_result$p.value, kpss_p_value = kpss_result$p.value))
}
for (i in 1:ncol(diffed_regressors1)) {
regressor_name <- colnames(diffed_regressors1)[i]
regressor_data <- diffed_regressors1[, i]
test_results <- check_stationarity(regressor_data)
cat("Stationarity tests for:", regressor_name, "\n")
cat("ADF test p-value:", test_results$adf_p_value, "\n")
cat("KPSS test p-value:", test_results$kpss_p_value, "\n\n")
}
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in kpss.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: tmpf
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: dwpf
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: sknt
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: feel
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: vsby
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: drct
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## Warning in adf.test(ts_data): p-value greater than printed p-value
## Stationarity tests for: p01i
## ADF test p-value: 0.01
## KPSS test p-value: 0.1
reg_arima_1 <- auto.arima(rides_aligned, xreg = diffed_regressors1, lambda = "auto", seasonal = TRUE)
print(summary(reg_arima_1))
## Series: rides_aligned
## Regression with ARIMA(0,1,0) errors
## Box Cox transformation: lambda= 1.568598
##
## Coefficients:
## drift tmpf dwpf sknt feel vsby drct
## 13394.28 -40362.58 67617.50 -650519.8 122065.2 2102166.4 15233.571
## s.e. 810780.02 672445.05 94865.52 337885.8 569340.3 378790.3 7433.018
## p01i
## -51412376
## s.e. 33841961
##
## sigma^2 = 1.833e+14: log likelihood = -7060.59
## AIC=14139.19 AICc=14139.66 BIC=14175.02
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 198.9138 20761.6 15033.97 -3.119669 20.06502 0.902888 -0.2406174
print(checkresiduals(reg_arima_1))
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,0) errors
## Q* = 62.261, df = 10, p-value = 1.35e-09
##
## Model df: 0. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,0) errors
## Q* = 62.261, df = 10, p-value = 1.35e-09
AICc=14139.66, p-value = 1.35e-09 for Ljung-Box test so residuals are autocorrelated.
In residual ACF, significant lags at 1, 7, and 21, indicating that there is still weekly seasonality in the residuals. Therefore, the multi-seasonality model is introduced.
temp <- periodogram(daily_ts)
max_freq1 <- temp$freq[which.max(temp$spec)]
seasonality1 <- 1/max_freq1
print("First Seasonality: ")
## [1] "First Seasonality: "
print(seasonality1)
## [1] 384
modified_spec <- temp$spec
modified_spec[which.max(temp$spec)] <- -Inf
max_freq2 <- temp$freq[which.max(modified_spec)]
seasonality2 <- 1/max_freq2
print("Second Seasonality: ")
## [1] "Second Seasonality: "
print(seasonality2)
## [1] 6.981818
# Extract top 5 frequencies based on highest spectral power
top_freqs <- temp$freq[order(temp$spec, decreasing = TRUE)[1:5]]
top_specs <- sort(temp$spec, decreasing = TRUE)[1:5]
seasonalities <- 1 / top_freqs
results <- data.frame(Frequency = top_freqs, Spectral_Power = top_specs, Seasonality = seasonalities)
print(results)
## Frequency Spectral_Power Seasonality
## 1 0.002604167 371461518563 384.000000
## 2 0.143229167 32730774449 6.981818
## 3 0.001302083 21049688114 768.000000
## 4 0.003906250 14254414458 256.000000
## 5 0.007812500 11915750567 128.000000
Since the first seasonality/period is 384 days and the second seasonality/period is 7 days, our data has both weekly seasonality and annual seasonality.
library(forecast)
final_average_df <- read.csv("final_daily_nyc_df.csv")
daily_msts <- msts(final_average_df$number_of_rides, seasonal.periods=c(365.25, 7), start = c(2022, 60))
summary(daily_msts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9851 69159 97296 93831 119554 161586
library(forecast)
lambda <- BoxCox.lambda(daily_msts)
daily_msts_boxcox <- BoxCox(daily_msts, lambda)
plot(daily_msts_boxcox, main = paste("Box-Cox Transformed Time Series Data, Lambda =", round(lambda, 2)), xlab = "Time", ylab = "Transformed Bike Rides")
daily_msts %>% mstl() %>%
autoplot()
xreg <- fourier(daily_msts, K=c(3, 4))
dh_regression1 <- auto.arima(daily_msts, xreg=xreg, seasonal=FALSE, lambda = 'auto')
summary(dh_regression1)
## Series: daily_msts
## Regression with ARIMA(1,1,1) errors
## Box Cox transformation: lambda= 0.3748346
##
## Coefficients:
## ar1 ma1 S1-7 C1-7 S2-7 C2-7 S3-7 C3-7
## 0.3549 -0.9854 6.6140 -5.1123 1.7794 0.8221 -0.5086 0.9097
## s.e. 0.0345 0.0050 0.9977 0.9980 0.7278 0.7287 0.6205 0.6217
## S1-365 C1-365 S2-365 C2-365 S3-365 C3-365 S4-365 C4-365
## 15.6286 -23.7366 5.6410 0.0743 -0.9852 2.5459 -3.8887 0.4182
## s.e. 1.8379 1.6996 1.4281 1.4028 1.3380 1.3405 1.3083 1.3137
##
## sigma^2 = 268.7: log likelihood = -3201.53
## AIC=6437.06 AICc=6437.88 BIC=6515.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2837.729 18139.73 13687.02 -3.34914 19.44719 0.5752233 -0.02917148
checkresiduals(dh_regression1)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,1,1) errors
## Q* = 170.23, df = 150, p-value = 0.1236
##
## Model df: 2. Total lags used: 152
newharmonics <- fourier(daily_msts, K=c(2, 5), h=365)
fc <- forecast(dh_regression1, xreg=newharmonics)
## Warning in forecast.forecast_ARIMA(dh_regression1, xreg = newharmonics): xreg
## contains different column names from the xreg used in training. Please check
## that the regressors are in the same order.
autoplot(fc)
K=c(3, 4): Passed Ljung-Box test, AICc=6437.88, MAE=13687.02
K=c(2, 5):Passed Ljung-Box test, AICc=6433.83, MAE=13538.87
library(forecast)
bestfit<- list(aicc=Inf)
maxK_weekly <- 3 # Up to 3 because 7/2 = 3.5
maxK_annual <- 10
for (i in 1:maxK_weekly) {
for (j in 1:maxK_annual) {
xreg <- fourier(daily_msts, K=c(i, j))
fit <- auto.arima(daily_msts, xreg=xreg, seasonal=FALSE, lambda = 'auto')
if (fit$aicc < bestfit$aicc) {
bestfit <- fit
best_K <- c(i, j)
}
}
}
summary(bestfit)
## Series: daily_msts
## Regression with ARIMA(2,1,3) errors
## Box Cox transformation: lambda= 0.3748346
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 S1-7 C1-7 S2-7
## -0.4713 0.4633 -0.1709 -0.9754 0.1689 6.6281 -5.1119 1.7790
## s.e. 0.1152 0.1031 0.1288 0.0131 0.1228 0.9465 0.9467 0.7273
## C2-7 S1-365 C1-365 S2-365 C2-365 S3-365 C3-365 S4-365
## 0.8207 15.5534 -23.7351 5.5801 0.0903 -1.0444 2.5825 -3.9401
## s.e. 0.7282 1.8766 1.7340 1.4788 1.4529 1.3915 1.3940 1.3624
## C4-365 S5-365 C5-365
## 0.4771 0.9782 -2.3900
## s.e. 1.3677 1.3518 1.3493
##
## sigma^2 = 266: log likelihood = -3196.35
## AIC=6432.7 AICc=6433.83 BIC=6525.39
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2793.025 17984.71 13538.87 -3.313332 19.23899 0.5689971
## ACF1
## Training set -0.008145507
checkresiduals(bestfit)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(2,1,3) errors
## Q* = 158.07, df = 147, p-value = 0.2518
##
## Model df: 5. Total lags used: 152
print(best_K)
## [1] 2 5
shapiro.test(bestfit$residuals)
##
## Shapiro-Wilk normality test
##
## data: bestfit$residuals
## W = 0.90564, p-value < 2.2e-16
Shapiro-Wilk normality test: p-value < 2.2e-16. Residuals are not normally distributed.
qqnorm(bestfit$residuals,main=expression(Normal~~Q-Q~~Plot))
qqline(bestfit$residuals)
Best K values for weekly and annual seasonality: 2, 5 with ARIMA(2,1,3) Errors.
AICc=6433.83, MAE=13538.87.
This model passed Ljung-Box test(p-value = 0.2518) but didn’t pass normality test.
library(forecast)
tbats1 <- tbats(daily_msts)
print(tbats1)
## TBATS(1, {1,2}, -, {<7,2>, <365.25,5>})
##
## Call: tbats(y = daily_msts)
##
## Parameters
## Alpha: 0.01276081
## Gamma-1 Values: -0.00147551 -0.003423304
## Gamma-2 Values: 0.003033274 1.922459e-06
## AR coefficients: 0.580015
## MA coefficients: -0.236126 -0.071849
##
## Seed States:
## [,1]
## [1,] 82565.4798
## [2,] 1754.3544
## [3,] 2015.1356
## [4,] 9653.5158
## [5,] -583.5445
## [6,] -29223.8878
## [7,] 732.5216
## [8,] 2022.6759
## [9,] 101.6095
## [10,] -1993.4930
## [11,] 18893.1044
## [12,] 6275.8671
## [13,] -1684.3807
## [14,] -4156.8555
## [15,] 1795.4921
## [16,] 0.0000
## [17,] 0.0000
## [18,] 0.0000
##
## Sigma: 17684.81
## AIC: 20014.01
checkresiduals(tbats1)
##
## Ljung-Box test
##
## data: Residuals from TBATS
## Q* = 162.44, df = 152, p-value = 0.2664
##
## Model df: 0. Total lags used: 152
fc1 <- forecast(tbats1, h=365)
autoplot(fc1)
Model: TBATS(1, {1,2}, -, {<7,2>, <365.25,5>}).
TBATS Model gave the same values for K(2, 5) as the Dynamic Harmonic Regression but different ARIMA Errors {1,2}. This model passed Ljung-Box test. AIC=20014.01.
inference <- read.csv("intervention_analysis_data.csv")
inference_ts <- ts(inference$number_of_rides, start = c(2018, 1), frequency = 365.25)
print(head(inference_ts))
## Time Series:
## Start = 2018
## End = 2018.01368925394
## Frequency = 365.25
## [1] 5500 18818 24299 1922 4972 4295
print(length(inference_ts))
## [1] 2281
plot(inference_ts, xlab = "Time", ylab = "Number of Rides", main = "Daily Bike Rides from January 2018 to March 2024")
Conclusion: There is no intervention effect to be modeled from Covid-19.
library(forecast)
model_1 <- function(x, h) {
if (length(x) >= 365 && length(x) + h <= length(daily_msts)) {
# Generate Fourier terms for the current subset x directly from msts
x_fourier <- fourier(x, K=c(2, 5))
# Fit the ARIMA model using fixed parameters
fit <- Arima(x, order=c(2, 1, 3), seasonal=c(0, 0, 0), xreg=x_fourier)
# Generate Fourier terms for forecasting
future_fourier <- fourier(x, K=c(2, 5), h=h)
# Forecast using the fitted ARIMA model with the future Fourier terms
fc <- forecast(fit, xreg=future_fourier)
return(fc)
} else {
print("Insufficient data for fitting and forecasting.")
return(forecast(rep(NA, h), h=h))
}
}
harmo_error_exp <- tsCV(daily_msts, model_1, h=7, initial=365)
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
## [1] "Insufficient data for fitting and forecasting."
harmo_error_exp <- na.omit(harmo_error_exp)
# print(harmo_error_exp)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
##
## accuracy
# Mean Error
mean_error <- mean(harmo_error_exp, na.rm = TRUE)
print(paste("Mean Error:", mean_error))
## [1] "Mean Error: 2946.60696016812"
# Standard Deviation of Error
std_error <- sd(harmo_error_exp, na.rm = TRUE)
print(paste("Standard Deviation of Error:", std_error))
## [1] "Standard Deviation of Error: 21705.5838029752"
# Mean Absolute Error (MAE)
mae_value <- mae(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Mean Absolute Error (MAE):", mae_value))
## [1] "Mean Absolute Error (MAE): 16245.9762863509"
# Root Mean Squared Error (RMSE)
rmse_value <- rmse(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Root Mean Squared Error (RMSE):", rmse_value))
## [1] "Root Mean Squared Error (RMSE): 21900.7370845512"
# Mean Squared Error (MSE)
mse_value <- mse(harmo_error_exp, rep(0, length(harmo_error_exp)))
print(paste("Mean Squared Error (MSE):", mse_value))
## [1] "Mean Squared Error (MSE): 479642284.846634"
# Median Absolute Error (MedAE)
medae_value <- median(abs(harmo_error_exp), na.rm = TRUE)
print(paste("Median Absolute Error (MedAE):", medae_value))
## [1] "Median Absolute Error (MedAE): 12061.0429664374"
future_fourier <- fourier(daily_msts, K=c(2, 5), h=30)
harmo_fc <- forecast(bestfit, xreg=future_fourier, level = c(80, 95))
print(harmo_fc)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2024.2493 94213.34 69652.06 123577.1 58475.24 141181.1
## 2024.2521 97550.66 71010.82 129554.9 59031.52 148846.8
## 2024.2548 104225.38 76254.72 137860.2 63595.91 158099.0
## 2024.2575 100769.97 73266.27 133958.6 60859.76 153972.9
## 2024.2603 100932.38 73385.88 134172.3 60959.93 154217.3
## 2024.2630 86476.82 61578.64 116869.5 50470.33 135330.1
## 2024.2658 84134.71 59685.67 114041.1 48799.52 132229.7
## 2024.2685 88985.74 63588.70 119927.2 52236.67 138698.5
## 2024.2712 102924.80 74955.68 136644.4 62328.34 156967.2
## 2024.2740 103952.29 75780.45 137897.2 63054.82 158348.6
## 2024.2767 108377.15 79417.46 143170.7 66300.18 164095.2
## 2024.2795 102083.16 74231.07 135689.7 61666.52 155955.0
## 2024.2822 93597.88 67301.54 125525.2 55508.83 144853.1
## 2024.2849 85599.31 60804.01 115907.4 49756.03 134332.2
## 2024.2877 96036.68 69273.12 128474.6 57250.88 148090.2
## 2024.2904 104844.03 76465.28 139029.6 63643.12 159622.7
## 2024.2932 111416.78 81872.90 146857.1 68471.38 168149.6
## 2024.2959 110557.35 81150.61 145855.1 67819.13 167070.4
## 2024.2986 109177.69 80012.76 144215.7 66801.76 165286.4
## 2024.3014 95747.09 68997.89 128184.8 56988.20 147806.9
## 2024.3041 91728.39 65727.88 123357.3 54089.31 142527.7
## 2024.3068 98372.55 71126.93 131351.3 58872.79 151277.5
## 2024.3096 111554.64 81943.24 147083.4 68513.76 168431.9
## 2024.3123 114139.79 84065.15 150171.0 70406.55 171801.1
## 2024.3151 117278.81 86658.46 153897.2 72728.28 175854.6
## 2024.3178 112021.75 82303.97 147673.9 68824.73 169094.9
## 2024.3205 101728.56 73844.09 135406.8 61276.51 155727.7
## 2024.3233 94417.15 67868.15 126657.1 55964.22 146176.5
## 2024.3260 104308.16 75945.33 138506.8 63141.81 159120.1
## 2024.3288 114742.01 84523.87 150941.5 70798.81 172671.2
plot(harmo_fc)
test_set <- read.csv("24April_test_rides.csv")
fc_values <- harmo_fc$mean
actual_values <- test_set$number_of_rides
comparison_df <- data.frame(
Actual = actual_values,
Forecasted = fc_values
)
print(comparison_df)
## Actual Forecasted
## 1 89230 94213.34
## 2 41395 97550.66
## 3 27031 104225.38
## 4 99708 100769.97
## 5 102359 100932.38
## 6 97087 86476.82
## 7 96966 84134.71
## 8 113957 88985.74
## 9 140252 102924.80
## 10 126407 103952.29
## 11 89245 108377.15
## 12 107309 102083.16
## 13 91502 93597.88
## 14 118128 85599.31
## 15 142162 96036.68
## 16 146951 104844.03
## 17 114294 111416.78
## 18 90995 110557.35
## 19 118962 109177.69
## 20 148800 95747.09
## 21 94132 91728.39
## 22 112278 98372.55
## 23 117106 111554.64
## 24 124140 114139.79
## 25 118147 117278.81
## 26 122802 112021.75
## 27 107860 101728.56
## 28 134159 94417.15
## 29 132321 104308.16
## 30 126078 114742.01
fc_values <- as.numeric(fc_values)
plot(actual_values, type = 'o', pch = 19, col = 'blue', ylim = c(0, max(c(actual_values, fc_values))),
xlab = "Days", ylab = "Number of Rides", main = "Comparison of Actual vs Forecasted Rides for April 2024",
xaxt = 'n')
lines(fc_values, type = 'o', pch = 19, col = 'red')
legend("topleft", legend = c("Actual", "Forecasted"), col = c("blue", "red"), pch = 19, lty = 1, cex = 0.8)
axis(1, at = 1:30, labels = 1:30)
Observations:
The model did well on capturing the general trends in the data, especially from day 21 to day 27. The forecasts for day 2 and day 3 are relatively poor because of the unusually low actual bike demands on those 2 days.
I used 2 years of data for this project. Since there is no intervention effect from Covid, I will expand the time horizon to several years in order for the models to better learn the annual patterns in data.
The current best model Dynamic Harmonic Regression didn’t pass the normality test for its residuals. I will experiment with more advanced models like Neural Networks to solve this problem.